Week 4, Excercise 4, Clustering and classification

This week we are learning how to cluster and classificate. First we need to load the Boston-data, from the MASS package.

library(MASS)
data("Boston")

#Then we explore the structure and the dimensions of the data

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The data has 506 objects and 14 variables.This dataset is about housing values in suburbs of Boston. The variables are shortened, so for us to understand what they mean I have explainde the variables below. Variables: “Crim” means per capita crime rate by town. “zn” means proportion of residential land zoned for lots over 25,000 sq.ft. “indus” means proportion of non-retail business acres per town. “chas” means Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). “nox” means nitrogen oxides concentration (parts per 10 million) “rm” means average number of rooms per dwelling “age” means proportion of owner-occupied units built prior to 1940 “dis” means weighted mean of distances to five Boston employment centres “rad” means index of accessibility to radial highways “tax” means full-value property-tax rate per $10,000. “ptratio” means pupil-teacher ratio by town “black” means 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town “lstat” means lower status of the population (percent) “medv” means median value of owner-occupied homes in $1000s

Then we are exploring the data by doing a graphical overview and by showing summaries of the variables.

First I downlond few packages I could need during the excercise.

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.3
## ✔ tidyr   1.0.0     ✔ dplyr   0.8.3
## ✔ readr   1.3.1     ✔ stringr 1.4.0
## ✔ tibble  2.1.3     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()

Next we are going to look at the summary of the Boston dataset

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

If you look at the variables more closely, you can see that almost all the variables are normally distributed. You can know this by seeing that median and mean values are more or less close to each other. Variables “Zn” and “Crim” are not normally distributed. The variable “Chas” is a binary variable.

Next were are going to look for the correlations of the dataset.

#First a picture of the graphical overview

pairs(Boston)

This picture is not very clear. It is a picture of the pairs plot. All the blots are so small and so near to each other thatyou can actually see only black picture. From this picture you canno’t see the correlations so next we are going to look at the correlations more closely.

#Let's make a correlation matrix and draw a correlation plot

cormatrix <- cor(Boston)
cormatrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
library(corrplot)
## corrplot 0.84 loaded
corrplot(cormatrix, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

My R couldn’t find a library corrplot, even tough that was instructed in the datacamp. Finally I found a package of corrplot. So after downloading that I could look at the correlations in this picture, which is more clear to me. There is strong negative correlation (big red ball), between dis-nox, dis-age and dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This seems clear and logical. Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is logical. A positive correlation is marked with a big blue ball. So if you look at the picture, you can see that rad and tax have a strong postive correlation. This means that when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises.

Standardize the dataset and print out summaries of the scaled data

Standardizing the dataset

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# The boston_scaled is a matrix and we are going to change it to a data for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)

After the data has been scaled you can see from the summary that all the means and medians are close to each other meaning that now they are normally distributed. This will help us in scaling this dat in a fitted model. Next we are going to change the continuous crime rate variable into a categorical variable. We need to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. Then we are going to drop the old crim variable from the dataset and replace it with the new crime variable.

# Creating a quantile vector

bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# creating a categorical variable crime

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
library(dplyr)

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
dim(boston_scaled)
## [1] 506  14

The data has still 506 objects and 14 variables. The instructions said that we need to divide the dataset to train and test sets, so that only 80% of the data belongs to the train set.

# We are going to make the train and the test sets by choosing 80% observations to the train set and the rest of the observations are in the test set.
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)
dim(ind)
## NULL
train <- boston_scaled[ind, ]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  3.05 -0.487 0.928 3.586 -0.487 ...
##  $ indus  : num  -1.33 1.23 -1.31 -1.41 1.23 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -1.206 2.73 -0.714 -1.309 2.73 ...
##  $ rm     : num  -0.175 -1.253 0.472 0.983 -0.979 ...
##  $ age    : num  -1.072 1.116 -0.372 -1.893 0.935 ...
##  $ dis    : num  1.175 -1.046 -0.202 1.832 -1.078 ...
##  $ rad    : num  -0.867 -0.522 -0.293 -0.752 -0.522 ...
##  $ tax    : num  -0.3574 -0.0311 -1.105 -0.037 -0.0311 ...
##  $ ptratio: num  -1.7347 -1.7347 -0.0257 -0.6723 -1.7347 ...
##  $ black  : num  0.4063 -0.1651 0.4018 0.4406 -0.0528 ...
##  $ lstat  : num  -0.7314 0.0878 -0.5214 -1.1333 1.2319 ...
##  $ medv   : num  0.17 -0.319 0.638 1.345 -0.776 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 3 1 1 3 4 1 4 2 4 ...

So now the train set had 404 observations and 14 variables. This is 80% of the 506 observations we began with.

Next we are creating the test set.

test <- boston_scaled[-ind, ]
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  -0.487 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -0.437 -0.437 -0.437 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.144 -0.144 -0.144 -0.144 ...
##  $ rm     : num  -0.268 -0.794 -0.671 -0.338 -0.303 ...
##  $ age    : num  0.5657 0.0329 1.1164 0.7185 1.1164 ...
##  $ dis    : num  0.31669 0.000692 0.142544 0.312653 0.180441 ...
##  $ rad    : num  -0.637 -0.637 -0.637 -0.637 -0.637 ...
##  $ tax    : num  -0.601 -0.601 -0.601 -0.601 -0.601 ...
##  $ ptratio: num  1.18 1.18 1.18 1.18 1.18 ...
##  $ black  : num  0.256 0.375 0.415 -0.551 0.22 ...
##  $ lstat  : num  -0.3351 -0.1923 1.012 0.6479 0.0542 ...
##  $ medv   : num  -0.471 -0.471 -0.873 -0.841 -0.873 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 3 3 3 3 3 3 2 1 2 2 ...

The test set has 102 observations and 14 variables. This is 20% of the original data set.

Next were are going to fit the linear discriminant analysis on the train set

We are using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We are going to draw the LDA (bi)plot of the model.

#Fitting the linear discriminant analysis on the train set

lda.fit <-  lda(formula = crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#Target classes as numeric 

classes <- as.numeric(train$crime)

#Drawing a plot of the lda results

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

#Next we are going to save the crime categories from the test set and then remove the categorical crime variable from the test dataset. After that we are going to predict the classes with the LDA model on the test data.

#We are going to save the crime categories from the test set

correct_classes <- test$crime

library(dplyr)

#Next we are going to remove the categorial crime variable from the test dataset

test <- dplyr::select(test, -crime)

# Next we are going to predict with the LDA model on the test data

lda.pred <- predict(lda.fit, newdata = test)

#Then was asked to do a cross table of the results

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19      10        0    0
##   med_low    2      18        1    0
##   med_high   0      12       11    1
##   high       0       0        0   28

Here you can see how the model is working with the predictions. The model works well when predicting the high crime rates. The model makes errors when predicting the other crime classes.

#Next we are going load the Boston dataset again and standardize the dataset. We are going to calculate the distances between the observations. We are going to run k-means algorithm on the dataset. We are going to investigate what is the optimal number of clusters and run the algorithm again. In the end we are going to visualize the clusters and interpret the results.

library(MASS)
data("Boston")
dim(Boston)
## [1] 506  14

We now have loaded the Boston dataset again from the MASS-library. We wanted to check that we have the correct amount of observations 506 and variables 14.

Next we are going to measure the distance. I am going to use the Euklidean-distance, which is probably the most common one. K-means is old and often used clustering method. K-means counts the distance matrix automatically but you have to choose the number of clusters. I made the model with 3 clusters, because my opinion is that it worked better than 4 clusters.

scale_Boston2 <- scale(Boston)
scale_Boston2 <- as.data.frame(scale_Boston2)

#Next we are going to calculate the distances, the Euklidean-distance.

dist_eu <- dist(scale_Boston2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# K-means clustering

km <- kmeans(scale_Boston2, centers = 3)
#Plotting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:6], col = km$cluster)

pairs(scale_Boston2[7:13], col = km$cluster)

Next were are going to investigate what is the optimal number of clusters. There are many ways to find out the optimal number of clusters, but we will use the Total of within cluster sum of squares (WCSS) and visualise the result with a plot.

# First determine the number of clusters

k_max <- 10

# Calculate the total within sum of squares

twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})

# Next we are going to visualize the results

qplot(x = 1:k_max, y = twcss, geom = "line")

The optimal number of clusters is when the total WCSS drops radically. As you can see from the picture above this happens around, when x= 2. So the optimal number of clusters would be 2. Next we run the algorithm again with two clusters.

km <- kmeans(scale_Boston2, centers = 2)

pairs(scale_Boston2, col = km$cluster)

In this first picture all the variables are included. Because there are so many variables I think the picture is quite difficult to interpret and understand. That is why I choose to do two more plots, so that I can look at the effects more closely.

pairs(scale_Boston2[1:8], col = km$cluster)

pairs(scale_Boston2[6:13], col = km$cluster)

As you can see from the pictures above the variable “chas” doesn’t follow any pattern with any of the variables. These kind of pictures are hard for me to understand because I am doing this the very first time. I think, however, that there might be negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.